tidyr
tidyr
1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time
1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time
IBM’s definition of data quality:
“Data quality measures how well a dataset meets criteria for accuracy, completeness, validity, consistency, uniqueness, timeliness, and fitness for purpose”
The Towards Data Science website has a nice definition of EDA:
“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics”
I think about “exploratory data analysis for data quality”
Will teach exploratory data analysis (EDA) in two sub-sections:
Assume that your goal in “EDA for data quality” is to investigate “input” data sources and create “analysis variables”
EDA is not a linear process, and the process will vary across people and projects Some broad steps:
It is critically important to step through EDA processes at multiple points during data cleaning, from the input / raw data to the output / analysis / clean data.
Always be aware of missing values
They will not always be coded as NA in input variables (e.g., some projects code them as 99, 99999, negative values, etc.)
“Unit of analysis” refers to “what does each observation represent” in an input data source
How to identify unit of analysis
data documentation
investigating the data set
This is very important because we often conduct analyses that span multiple units of analysis (e.g., between- v within-person, person- v stimuli-level, etc.)
We have to be careful and thoughtful about identifiers that let us do that (important for joining data together, which will be the focus on our R workshop today)
Rules I follow for variable creation
1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time
1. Welcome & Q’s on homework 2. Part 1: Data Quality and Descriptives 3. Part 2: tidyr 4. Problem set & Q time
Data Wrangling in tidyr
tidyrpivot_longer(), which takes a “wide” format data frame and makes it long.pivot_wider(), which takes a “long” format data frame and makes it wide.tidyrfull_join(), which merges all rows in either data frameinner_join(), which merges rows whose keys are present in both data framesleft_join(), which “prioritizes” the first data setright_join(), which “prioritizes” the second data set(See also:anti_join() and semi_join())
tidyr Functionspivot_longer()
gather()) Makes wide data long, based on a key
data: the data, blank if pipedcols: columns to be made long, selected via select() callsnames_to: name(s) of key column(s) in new long data frame (string or string vector)values_to: name of values in new long data frame (string)names_sep: separator in column headers, if multiple keysvalues_drop_na: drop missing cells (similar to na.rm = T) pivot_longer(): Basic ApplicationLet’s start with an easy one – one key, one value:
# A tibble: 69,492 × 6
SID gender education age item values
<chr> <int> <int> <int> <chr> <int>
1 61617 1 NA 16 A1 2
2 61617 1 NA 16 A2 4
3 61617 1 NA 16 A3 3
4 61617 1 NA 16 A4 4
5 61617 1 NA 16 A5 4
6 61617 1 NA 16 C1 2
7 61617 1 NA 16 C2 3
8 61617 1 NA 16 C3 3
# ℹ 69,484 more rows
pivot_longer(): More Advanced ApplicationNow a harder one – two keys, one value:
# A tibble: 69,492 × 7
SID gender education age trait item_num values
<chr> <int> <int> <int> <chr> <chr> <int>
1 61617 1 NA 16 A 1 2
2 61617 1 NA 16 A 2 4
3 61617 1 NA 16 A 3 3
4 61617 1 NA 16 A 4 4
5 61617 1 NA 16 A 5 4
6 61617 1 NA 16 C 1 2
7 61617 1 NA 16 C 2 3
8 61617 1 NA 16 C 3 3
# ℹ 69,484 more rows
pivot_wider()
spread()) Makes wide data long, based on a key
data: the data, blank if pipednames_from: name(s) of key column(s) in new long data frame (string or string vector)names_sep: separator in column headers, if multiple keysnames_glue: specify multiple or custom separators of multiple keysvalues_from: name of values in new long data frame (string)values_fn: function applied to data with duplicate labels pivot_wider(): Basic Application# A tibble: 2,800 × 29
SID gender education age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 1 NA 16 2 4 3 4 4 2 3 3
2 61618 2 NA 18 2 4 5 2 5 5 4 4
3 61620 2 NA 17 5 4 5 4 4 4 5 4
4 61621 2 NA 17 4 4 6 5 5 4 4 3
5 61622 1 NA 17 2 3 3 4 5 4 4 5
6 61623 2 3 21 6 6 5 6 5 6 6 6
7 61624 1 NA 18 2 5 5 3 5 5 4 4
8 61629 1 2 19 4 3 1 5 1 3 2 4
9 61630 1 1 19 4 3 6 3 3 6 6 3
10 61633 2 NA 17 2 5 6 6 5 6 5 6
# ℹ 2,790 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
pivot_wider(): More Advancedbfi_long %>%
pivot_wider(
names_from = c("trait", "item_num")
, values_from = "values"
, names_sep = "_"
)# A tibble: 2,800 × 29
SID gender education age A_1 A_2 A_3 A_4 A_5 C_1 C_2 C_3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 1 NA 16 2 4 3 4 4 2 3 3
2 61618 2 NA 18 2 4 5 2 5 5 4 4
3 61620 2 NA 17 5 4 5 4 4 4 5 4
4 61621 2 NA 17 4 4 6 5 5 4 4 3
5 61622 1 NA 17 2 3 3 4 5 4 4 5
6 61623 2 3 21 6 6 5 6 5 6 6 6
7 61624 1 NA 18 2 5 5 3 5 5 4 4
8 61629 1 2 19 4 3 1 5 1 3 2 4
9 61630 1 1 19 4 3 6 3 3 6 6 3
10 61633 2 NA 17 2 5 6 6 5 6 5 6
# ℹ 2,790 more rows
# ℹ 17 more variables: C_4 <int>, C_5 <int>, E_1 <int>, E_2 <int>, E_3 <int>,
# E_4 <int>, E_5 <int>, N_1 <int>, N_2 <int>, N_3 <int>, N_4 <int>,
# N_5 <int>, O_1 <int>, O_2 <int>, O_3 <int>, O_4 <int>, O_5 <int>
pivot_wider(): A Little More Advancedbfi_long %>%
select(-item_num) %>%
pivot_wider(
names_from = "trait"
, values_from = "values"
, names_sep = "_"
, values_fn = mean
)# A tibble: 2,800 × 9
SID gender education age A C E N O
<chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 61617 1 NA 16 3.4 3.2 3.4 2.8 3.8
2 61618 2 NA 18 3.6 4 3 3.8 3.2
3 61620 2 NA 17 4.4 4 3.8 3.6 3.6
4 61621 2 NA 17 4.8 4.2 4 2.8 3.6
5 61622 1 NA 17 3.4 3.6 3.6 3.2 3.2
6 61623 2 3 21 5.6 4.4 4 3 3.8
7 61624 1 NA 18 4 3.6 4.2 1.4 3.8
8 61629 1 2 19 2.8 3 3.2 4.2 3.4
9 61630 1 1 19 3.8 4.8 3.75 3.6 5
10 61633 2 NA 17 4.8 4 3.6 4.2 3.6
# ℹ 2,790 more rows
dplyr Functions_join() Functionsfull_join()inner_join()left_join()right_join()_join() Functionsbfi_only <- bfi %>%
rownames_to_column("SID") %>%
select(SID, matches("[0-9]"))
bfi_only %>% as_tibble() %>% print(n = 6)# A tibble: 2,800 × 26
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 13 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>
bfi_dem <- bfi %>%
rownames_to_column("SID") %>%
select(SID, education, gender, age)
bfi_dem %>% as_tibble() %>% print(n = 6)# A tibble: 2,800 × 4
SID education gender age
<chr> <int> <int> <int>
1 61617 NA 1 16
2 61618 NA 2 18
3 61620 NA 2 17
4 61621 NA 2 17
5 61622 NA 1 17
6 61623 3 2 21
# ℹ 2,794 more rows
full_join()
Most simply, we can put those back together keeping all observations.
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, education <int>, gender <int>, age <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
inner_join()
We can also keep all rows present in both data frames
bfi_dem %>%
filter(row_number() %in% 1:1700) %>%
inner_join(
bfi_only %>%
filter(row_number() %in% 1200:2800)
) %>%
as_tibble() %>%
print(n = 6)# A tibble: 501 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 64151 3 2 18 1 5 6 5 5 5 6 5
2 64152 4 2 29 1 5 6 5 5 2 1 4
3 64154 5 1 46 2 5 6 5 6 6 6 6
4 64155 5 1 58 5 4 4 4 5 4 4 5
5 64156 5 2 38 1 4 6 6 6 4 4 5
6 64158 5 2 27 2 3 1 1 1 4 2 2
# ℹ 495 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
left_join()
Or all rows present in the left (first) data frame, perhaps if it’s a subset of people with complete data
# A tibble: 2,577 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623 3 2 21 6 6 5 6 5 6 6 6
2 61629 2 1 19 4 3 1 5 1 3 2 4
3 61630 1 1 19 4 3 6 3 3 6 6 3
4 61634 1 1 21 4 4 5 6 5 4 3 5
5 61640 1 1 17 4 5 2 2 1 5 5 5
6 61661 5 1 68 1 5 6 5 6 4 3 2
# ℹ 2,571 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
right_join()
Or all rows present in the right (second) data frame, such as I do when I join a codebook with raw data
# A tibble: 2,800 × 29
SID education gender age A1 A2 A3 A4 A5 C1 C2 C3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61623 3 2 21 6 6 5 6 5 6 6 6
2 61629 2 1 19 4 3 1 5 1 3 2 4
3 61630 1 1 19 4 3 6 3 3 6 6 3
4 61634 1 1 21 4 4 5 6 5 4 3 5
5 61640 1 1 17 4 5 2 2 1 5 5 5
6 61661 5 1 68 1 5 6 5 6 4 3 2
# ℹ 2,794 more rows
# ℹ 17 more variables: C4 <int>, C5 <int>, E1 <int>, E2 <int>, E3 <int>,
# E4 <int>, E5 <int>, N1 <int>, N2 <int>, N3 <int>, N4 <int>, N5 <int>,
# O1 <int>, O2 <int>, O3 <int>, O4 <int>, O5 <int>
# A tibble: 2,800 × 29
SID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 61617 2 4 3 4 4 2 3 3 4 4 3 3
2 61618 2 4 5 2 5 5 4 4 3 4 1 1
3 61620 5 4 5 4 4 4 5 4 2 5 2 4
4 61621 4 4 6 5 5 4 4 3 5 5 5 3
5 61622 2 3 3 4 5 4 4 5 3 2 2 2
6 61623 6 6 5 6 5 6 6 6 1 3 2 1
# ℹ 2,794 more rows
# ℹ 16 more variables: E3 <int>, E4 <int>, E5 <int>, N1 <int>, N2 <int>,
# N3 <int>, N4 <int>, N5 <int>, O1 <int>, O2 <int>, O3 <int>, O4 <int>,
# O5 <int>, gender <int>, education <int>, age <int>
Part 1 of these slides was adapted from
PSC 290 - Data Management and Cleaning